Today, we are going to check out how well we can predict the chlorophyll content and the plant height of spring barley genotypes. We apply different models and check out the coding needed to perform to get what we want - good predictions from (genomic) models.
geno = read.table("data/intro/genotypes_GPW_match.txt", na.strings = c("NA", "./."))
head(geno)
## G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19
## chr1H_145558 CC CC CC GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG
## chr1H_146097 TT TT TT TT TT TT CC TT CC TT TT TT TT TT CC TT TT TT CC
## chr1H_146155 GG GG GG GG GG GG AA GG AA GG GG GG GG GG AA GG GG GG AA
## chr1H_158861 CC CC CC CC CC CC AA CC AA CC CC AA CC AA AA CC AA CC AA
## chr1H_162979 GG TT TT TT GG TT GG TT GG TT TT GG TT TT GG GG TT TT GG
## chr1H_164176 TT TT TT TT GG TT GG GG GG TT GG GG GG TT GG TT TT GG GG
## G20 G21 G22 G23
## chr1H_145558 GG GG GG CC
## chr1H_146097 CC CC TT TT
## chr1H_146155 AA AA GG GG
## chr1H_158861 AA AA CC CC
## chr1H_162979 GG TT TT TT
## chr1H_164176 GG TT TT TT
This is an pre processed vcf file, with the genotype information extracted from the original file and alleles already converted to base pairs. Typically, alleles are coded 0/0, 0/1 & 1/1 (with ./. indicating missing information).
0 is equivalent to the reference allele (the base present in the reference genome)
1 analogue to this is the alternative allele
In our case, we received reconverted information back to base
pairs.
Lets have a first look at the data - it would be
interesting to know how many markers and genotypes we have
dim(geno)
## [1] 21298 23
So, we have ~21,000 markers and 23 genotypes
All genomic prediction models require numeric values to perform the calculations, typicall the numeric coding in the format -1, 0, 1 for all three genotypes (homozygot and heterozypgot).
Let’s write a simple code chunk to quickly do this.
We always
want the major allele to become the reference allele So we will quickly
generate a “reference” vector, where all major alleles are stored in
NOTE - later on, we will use the same reference to convert the
marker file from the progenies as well.
# to get the major allele for each marker, we need to transpose the DF
geno2 = as.data.frame(t(geno))
# get the major allele
ref = apply(geno2, 2, function(x) names(which(table(x) == max(table(x))))) # we have not heterozyogte alleles or missing values
ref = as.data.frame(ref)
# reverse the transposing to speed this up in the next step
geno2 = as.data.frame(t(geno2))
# convert the coding to ref = -1 and alt = 1
for (i in c(1:ncol(geno2))){
geno2[geno2[,i] != ref,i] = 1
geno2[geno2[,i] == ref,i] = -1
}
# we go back to the markers sitting in the columns
geno2 = as.data.frame(t(geno2))
# convert the values of -1, 0, 1, currently stored as character, to an integer
for(i in c(1:ncol(geno2))){geno2[,i] = as.integer(geno2[,i])}
# have a look
geno2[1:10,1:10]
## chr1H_145558 chr1H_146097 chr1H_146155 chr1H_158861 chr1H_162979
## G1 1 -1 -1 -1 1
## G2 1 -1 -1 -1 -1
## G3 1 -1 -1 -1 -1
## G4 -1 -1 -1 -1 -1
## G5 -1 -1 -1 -1 1
## G6 -1 -1 -1 -1 -1
## G7 -1 1 1 1 1
## G8 -1 -1 -1 -1 -1
## G9 -1 1 1 1 1
## G10 -1 -1 -1 -1 -1
## chr1H_164176 chr1H_164469 chr1H_167034 chr1H_167678 chr1H_173286
## G1 -1 1 1 -1 -1
## G2 -1 -1 1 -1 -1
## G3 -1 1 -1 -1 -1
## G4 -1 1 -1 1 -1
## G5 1 1 -1 -1 1
## G6 -1 1 -1 1 -1
## G7 1 -1 1 -1 -1
## G8 1 -1 -1 -1 1
## G9 1 -1 1 -1 -1
## G10 -1 1 -1 1 -1
Now, where the genotyping file is prepared, we go ahead and have a quick look at the corresponding phenotype file
# loading the file
pheno = read.csv("./data/intro/Adjusted_means_daywise_dev.csv")
pheno$Time = as.Date(pheno$time, format = "%d.%m.%Y")
# extracting relevant information from each column to print it to the screen
env = unique(pheno$Environment)
gen = unique(pheno$Genotype)
time = unique(pheno$time)
p1 = summary(pheno$Plant.height.)
p2 = summary(pheno$Relative.Chlorophyll)
out = list(env, gen, time, p1, p2)
out
## [[1]]
## [1] "16h" "8h"
##
## [[2]]
## [1] "G1" "G2" "G3" "G4" "G5" "G6" "G7" "G8" "G11" "G9" "G12" "G13"
## [13] "G14" "G15" "G16" "G17" "G18" "G10" "G19" "G20" "G21" "G22" "G23"
##
## [[3]]
## [1] "01.04.2021" "02.04.2021" "04.04.2021" "05.04.2021" "06.04.2021"
## [6] "07.04.2021" "08.04.2021" "09.04.2021" "11.04.2021" "12.04.2021"
## [11] "13.04.2021" "14.04.2021" "15.04.2021" "17.04.2021" "18.04.2021"
## [16] "30.03.2021" "31.03.2021"
##
## [[4]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.135 23.378 29.324 28.673 34.221 46.844
##
## [[5]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.28 36.98 41.32 42.39 48.47 55.92
So to conclude, we have two phenotypes (Chlorophyll content, plant height), 23 genotypes, 2 environments & 17 consequtive days were measurements were performed.
As everything is prepared and ready to be analysed by genomics, we continue with the next section & our first approach, to model the phenotype without genomic data.
We start with making a BLUP like estimation of the phenotypes, where we do not include the genomic data for now.
To do so, we can use simple linear models, or mixed linear models with fixed and random factors. The goal is to retain a predicted phenotype score and check, how this performs by correlation results.
# construct the model
model = lm(Relative.Chlorophyll ~ Genotype + Environment + time, data=pheno)
# just quickly check the anova
anova(model) # all factors have a significant effect
## Analysis of Variance Table
##
## Response: Relative.Chlorophyll
## Df Sum Sq Mean Sq F value Pr(>F)
## Genotype 22 6432.6 292.4 122.787 < 2.2e-16 ***
## Environment 1 27885.4 27885.4 11710.215 < 2.2e-16 ***
## time 16 669.1 41.8 17.561 < 2.2e-16 ***
## Residuals 742 1766.9 2.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# use the model to predict phenotypes
calculated = predict(model)
cor(pheno$Relative.Chlorophyll, calculated)
## [1] 0.9756669
So this model already gives a pretty good fit for the data set we used. But so far, we only used the “training” set. So the model knew about all genotypes and their phenotypes. How does the accuracy changes, when the model only knows 50% of all entries?
# sample 50% of all traits
train = sample(c(1:nrow(pheno)), round(nrow(pheno)*0.5))
test = setdiff(c(1:nrow(pheno)), train)
# generate the two sub samples
pheno_train = pheno[train,]
pheno_test = pheno[test,]
# train the linear model with the training set
model1 = lm(Relative.Chlorophyll ~ Genotype + Environment + time, data=pheno_train)
# use the model to predict the accuracy of the estimated chlorophyll contents
p_train = predict(model1, pheno_train)
p_test = predict(model1, pheno_test)
# calculate the correlation scores
cor(pheno_train$Relative.Chlorophyll, p_train)
## [1] 0.9782183
cor(pheno_test$Relative.Chlorophyll, p_test)
## [1] 0.9700071
plot(pheno_test$Relative.Chlorophyll, p_test)
The correlations are remarkable high, what could be the reason?
We will make some changes to the model - converting the
environment and time to random factors - so the values observed are
coming form a normal distribution. So contrasting to before, we will use
a Mixed Model.
#install.packages("lme4")
library(lme4)
## Loading required package: Matrix
# sample 50% of all traits
train = sample(c(1:nrow(pheno)), round(nrow(pheno)*0.5))
test = setdiff(c(1:nrow(pheno)), train)
# generate the two sub samples
pheno_train = pheno[train,]
pheno_test = pheno[test,]
# train the linear model with the training set
model2 = lmer(Relative.Chlorophyll ~ Genotype + (1|Environment) + (1|time), data=pheno_train)
# use the model to predict the accuracy of the estimated chlorophyll contents
p_train = predict(model2, pheno_train)
p_test = predict(model2, pheno_test)
# calculate the correlation scores
cor(pheno_train$Relative.Chlorophyll, p_train)
## [1] 0.9771261
cor(pheno_test$Relative.Chlorophyll, p_test)
## [1] 0.9722222
plot(pheno_test$Relative.Chlorophyll, p_test)
We can also go a step further and set the genotype to a random
factor.
# sample 50% of all traits
train = sample(c(1:nrow(pheno)), round(nrow(pheno)*0.5))
test = setdiff(c(1:nrow(pheno)), train)
# generate the two sub samples
pheno_train = pheno[train,]
pheno_test = pheno[test,]
# train the linear model with the training set
model3 = lmer(Relative.Chlorophyll ~ 1 + (1|Genotype) + (1|Environment) + (1|time), data=pheno_train)
# use the model to predict the accuracy of the estimated chlorophyll contents
p_train = predict(model3, pheno_train)
p_test = predict(model3, pheno_test)
# calculate the correlation scores
cor(pheno_train$Relative.Chlorophyll, p_train)
## [1] 0.9785198
cor(pheno_test$Relative.Chlorophyll, p_test)
## [1] 0.9693612
library(ggplot2)
pheno_test$estimate_test = p_test
ggplot(pheno_test, aes(x=Relative.Chlorophyll, y=estimate_test, shape = Environment, color = Time, size= time, label = Genotype)) +
geom_point() + #geom_text(hjust=0, vjust=0) +
ylab("Predicted") + xlab("Relative Chlorophyll content")
## Warning: Using size for a discrete variable is not advised.
What can be observed is, that the robustness of the prediction
is even increased with random factors. Still, we have not performed any
replications or cross validations.
Let’s perform some iterations of random sampling form the phenotypes
i=0
sink = Matrix(nrow=0, ncol=2)
while (i < 50){
i=i+1
# sample 50% of all traits
train = sample(c(1:nrow(pheno)), round(nrow(pheno)*0.5))
test = setdiff(c(1:nrow(pheno)), train)
# generate the two sub samples
pheno_train = pheno[train,]
pheno_test = pheno[test,]
# train the linear model with the training set
model3 = lmer(Relative.Chlorophyll ~ 1 + (1|Genotype) + (1|Environment) + (1|time), data=pheno_train)
# use the model to predict the accuracy of the estimated chlorophyll contents
p_train = predict(model3, pheno_train)
p_test = predict(model3, pheno_test)
# calculate the correlation scores
a = cor(pheno_train$Relative.Chlorophyll, p_train)
b = cor(pheno_test$Relative.Chlorophyll, p_test)
sink = rbind(sink, c(a,b))
}
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00270526 (tol = 0.002, component 1)
mean(sink[,2])
## [1] 0.9718221
We observe on average a correlation across 50 replicated sampling
steps with 50% test/train ratio of 0.97.
What is the correlation
accuracy, when only considering the within environment effects?
test = cbind(pheno_test, p_test)
test16 = test[test$Environment=="16h",]
cor(test16$Relative.Chlorophyll, test16$p_test)
## [1] 0.9321402
This value is lower, but still very high. Why is it that high? This
is kind of unexpected - it should be particularly lower. The answer is,
the chlorophyll values we use are adjusted means, based on the
formula
lmer(Chlorophyll ~ Environment*Genotype + time + (1|Environment:Wiederholung),data=pheno).
So, all our three predictors can be found in this equation, which
results in a kind of redundant questioning:
Adjust (missing) data points by a linear model including the environment, time and genotype ad variables => predict the phenotype of the adjusted means by these three variabels = equal results
It would be nice to see the correlation with the raw data, so where the values not have been adjusted. This will give a much more realistic view of the precision.
# read in this data
raw = read.csv("/Data/michael/data/intro/unadjusted_parental_Chlorophyll_scores_dummy.csv")
# quick look into the data
dim(raw)
## [1] 4659 5
head(raw)
## Environment Genotype time Plant.height. Relative.Chlorophyll
## 1 16h G1 30.03.2021 18 44.50
## 2 16h G1 30.03.2021 17 46.33
## 3 16h G1 30.03.2021 17 52.16
## 4 16h G1 30.03.2021 13 14.83
## 5 16h G1 30.03.2021 13 34.72
## 6 16h G1 30.03.2021 16 40.08
# afterwards, we can simply predict the chlorophyll contents in this table using the predict function
m1 = predict(model1, raw)
m2 = predict(model2, raw)
m3 = predict(model3, raw)
par(mfrow = c(1, 3))
plot(m1, raw$Relative.Chlorophyll, main = cor(m1, raw$Relative.Chlorophyll, use = "complete"))
plot(m2, raw$Relative.Chlorophyll, main = cor(m2, raw$Relative.Chlorophyll, use = "complete"))
plot(m3, raw$Relative.Chlorophyll, main = cor(m3, raw$Relative.Chlorophyll, use = "complete"))
Aha. The predictions are actually much less good - and the
figures also illustrate that we have a high variance - So e.g., a
predicted value of 35 is very likely to be within the range of 20 to 45.
Obviously, we could make a guess if a genotype has the potential for a
low or high chlorophyll content, but we are far from being able to say
Genotype 1 is better than genotype 2. Still, these are not bad results,
but we are looking for a better accuracy and prediction ability.
We go on with the genomics part, where we continue with the next section - rrBLUP.
The rrBLUP
allows to calculate either a GWAS, and also a genomic prediction. For
the genomic prediction, we use the mixed.solve
function.
First, lets load the package in the R environment
#install.packages("rrBLUP")
library(rrBLUP)
Now, lets have a look what it is doing and what needs to be specified.
| mixed.solve | R Documentation |
Calculates maximum-likelihood (ML/REML) solutions for mixed models of the form
y = X \beta + Z u + \varepsilon
where \beta is a vector of fixed effects and u is a vector of random effects with
Var[u] = K \sigma^2_u. The residual variance is Var[\varepsilon] = I \sigma^2_e. This class
of mixed models, in which there is a single variance component other than the residual error,
has a close relationship with ridge regression (ridge parameter \lambda = \sigma_e^2 / \sigma^2_u).
mixed.solve(y, Z=NULL, K=NULL, X=NULL, method="REML",
bounds=c(1e-09, 1e+09), SE=FALSE, return.Hinv=FALSE)
y |
Vector ( |
Z |
Design matrix ( |
K |
Covariance matrix ( |
X |
Design matrix ( |
method |
Specifies whether the full ("ML") or restricted ("REML") maximum-likelihood method is used. |
bounds |
Array with two elements specifying the lower and upper bound for the ridge parameter. |
SE |
If TRUE, standard errors are calculated. |
return.Hinv |
If TRUE, the function returns the inverse of |
This function can be used to predict marker effects or breeding values (see examples). The numerical method
is based on the spectral decomposition of Z K Z' and S Z K Z' S, where S = I - X (X' X)^{-1} X' is
the projection operator for the nullspace of X (Kang et al., 2008). This algorithm generates the inverse phenotypic covariance matrix V^{-1}, which can then be used to calculate the BLUE and BLUP solutions for the fixed and random effects, respectively, using standard formulas (Searle et al. 1992):
BLUE(\beta) = \beta^* = (X'V^{-1}X)^{-1}X'V^{-1}y
BLUP(u) = u^* = \sigma^2_u KZ'V^{-1}(y-X\beta^*)
The standard errors are calculated as the square root of the diagonal elements of the following matrices (Searle et al. 1992):
Var[\beta^*] = (X'V^{-1}X)^{-1}
Var[u^*-u] = K \sigma^2_u - \sigma^4_u KZ'V^{-1}ZK + \sigma^4_u KZ'V^{-1}XVar[\beta^*]X'V^{-1}ZK
For marker effects where K = I, the function will run faster if K is not passed than if the user passes the identity matrix.
If SE=FALSE, the function returns a list containing
estimator for \sigma^2_u
estimator for \sigma^2_e
BLUE(\beta)
BLUP(u)
maximized log-likelihood (full or restricted, depending on method)
If SE=TRUE, the list also contains
standard error for \beta
standard error for u^*-u
If return.Hinv=TRUE, the list also contains
the inverse of H
Kang et al. 2008. Efficient control of population structure in model organism association mapping. Genetics 178:1709-1723.
Endelman, J.B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255.
Searle, S.R., G. Casella and C.E. McCulloch. 1992. Variance Components. John Wiley, Hoboken.
#random population of 200 lines with 1000 markers
M <- matrix(rep(0,200*1000),200,1000)
for (i in 1:200) {
M[i,] <- ifelse(runif(1000)<0.5,-1,1)
}
#random phenotypes
u <- rnorm(1000)
g <- as.vector(crossprod(t(M),u))
h2 <- 0.5 #heritability
y <- g + rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))
#predict marker effects
ans <- mixed.solve(y,Z=M) #By default K = I
accuracy <- cor(u,ans$u)
#predict breeding values
ans <- mixed.solve(y,K=A.mat(M))
accuracy <- cor(g,ans$u)
So let’s get started.
We will select for first the relative chlorophyll contents at the 01.04.2021 and the environment 16h
Before, that, we have to make sure the genotypes are in the same order in both genotype and phenotype file.
ALLERT - make sure the order of the genotypes is
identical in both the phenotype and the gentype files.
all(row.names(geno2) == unique(pheno$Genotype))
## [1] FALSE
We can see, this is not the case yet, so we have to order the pheno file accordingly to the genotype file
geno2 = geno2[order(match(row.names(geno2),unique(pheno$Genotype))),]
# check the result
all(row.names(geno2) == unique(pheno$Genotype))
## [1] TRUE
ph = pheno[pheno$Environment== "16h" & pheno$time=="01.04.2021",c(2,5)]
Great. So now, we can run the genomic prediction.
# impute the missing values by A.mat AND get the Kinship
# the imputation is not required here, but for the notes, this is the code
g2 = rrBLUP::A.mat(geno2, impute.method = "mean", return.imputed = T, n.core=10)
#geno2 = g2$imputed
g2$A[1:7,1:7] # this is the additive relationship matrix we need for the genomic blup
## G1 G2 G3 G4 G5 G6
## G1 2.0524596 0.200726781 -0.3957087 0.46262839 -0.146179182 0.0537840
## G2 0.2007268 1.956683313 -0.3677447 0.08619727 -0.006456693 -0.1422178
## G3 -0.3957087 -0.367744705 2.1735849 -0.28941425 -0.425155818 -0.3091237
## G4 0.4626284 0.086197271 -0.2894142 2.00382990 -0.267441130 0.2498443
## G5 -0.1461792 -0.006456693 -0.4251558 -0.26744113 1.857570112 -0.3742235
## G6 0.0537840 -0.142217776 -0.3091237 0.24984431 -0.374223458 2.0860437
## G7 -0.4157499 -0.228900360 0.2154991 -0.24257989 -0.293941569 -0.2640847
## G7
## G1 -0.4157499
## G2 -0.2289004
## G3 0.2154991
## G4 -0.2425799
## G5 -0.2939416
## G6 -0.2640847
## G7 1.9126785
# run the GBLUP
gb = mixed.solve(ph$Relative.Chlorophyll, K = g2$A)
# run the rrBLUP
rr = mixed.solve(ph$Relative.Chlorophyll, Z = geno2)
Let’s have a look at the results of the ridge regression
BLUP - rrBLUP
# have a look at the results - it is a list of different results
rr$beta
## [1] 48.27502
head(rr$u)
## chr1H_145558 chr1H_146097 chr1H_146155 chr1H_158861 chr1H_162979
## 0.0005658129 -0.0003726253 -0.0003726253 -0.0011230956 -0.0015781662
## chr1H_164176
## -0.0006665224
# extract the marker effects
u = as.vector(rr$u)
# use this to generate a genomic blup value for each genotype
blup = as.matrix(geno2) %*% u
# add the fixed y-axis intercept to the blup value, to retain an actual estimate of the chlorophyll content
rr_estimate = as.vector(rr$beta) + blup
# compare to the real values we used to regress the model
cor(ph$Relative.Chlorophyll, rr_estimate)
## [,1]
## [1,] 0.9999072
# make a figure
plot(ph$Relative.Chlorophyll, rr_estimate)
Looks pretty good. But what about the genomic BLUP results
gb Please note, - we do not get a result of a
which we need to matrix multiplication with the marker values
Z. Instead, we retain a ready value Za
blup = gb$u
# add the fixed y-axis intercept to the blup value, to retain an actual estimate of the chlorophyll content
gb_estimate = as.vector(gb$beta) + blup
# compare to the real values we used to regress the model
cor(ph$Relative.Chlorophyll, gb_estimate)
## [1] 0.9999072
# make a figure
plot(ph$Relative.Chlorophyll, gb_estimate)
We can see, that we do not have any difference in the precision between rrBLUP and GBLUP. Both give a very high prediction ability, which means the fit of the model is really good.
It can summarized here - using all 21,000 markers was as effective as using just the covariance matrix of the genotypes (which is a 23x23 matrix).
So far, we only tried to predict one environment on one day of all available days. So can we also generate a prediction of the phenotypes, when we include all environments and dates? Of course we can. Here is the code to do it:
# repeat Z 34 times and sort it by row.names - we have 2 environments and 17 dayes - so 2 * 17 = 34
geno3 = do.call(rbind, replicate(34, geno2, simplify=FALSE))
geno3 = geno3[order(row.names(geno3)),]
# we want to include the factor Environment and time also in the prediction.
# These can only be includes as integer or numeric values, so we need to convert them.
pheno$s1 = as.integer(as.factor(pheno$Environment))
pheno$s2 = as.integer(as.factor(pheno$Time))
# rrBLUP model
se = mixed.solve(pheno$Relative.Chlorophyll, Z = geno3, X = pheno[,c(7,8)])
# Check out the results
se$beta # the 8h environment has a 13.2 lower score than the 16h env
## s1 s2
## -13.463041 0.119687
head(se$u)
## chr1H_145558 chr1H_146097 chr1H_146155 chr1H_158861 chr1H_162979 chr1H_164176
## -0.009922187 -0.008257496 -0.008257496 -0.003045178 -0.004177840 -0.003731731
Contrasting to before, we have two fixed beta
scores for both our fixed effects environment and date. Let’s quickly
check out the coding of them
unique(pheno[,c(1,6)])
## Environment Time
## 1 16h 2021-04-01
## 2 16h 2021-04-02
## 3 16h 2021-04-04
## 4 16h 2021-04-05
## 5 16h 2021-04-06
## 6 16h 2021-04-07
## 7 16h 2021-04-08
## 8 16h 2021-04-09
## 9 16h 2021-04-11
## 10 16h 2021-04-12
## 11 16h 2021-04-13
## 12 16h 2021-04-14
## 13 16h 2021-04-15
## 14 16h 2021-04-17
## 15 16h 2021-04-18
## 16 16h 2021-03-30
## 17 16h 2021-03-31
## 392 8h 2021-04-01
## 393 8h 2021-04-02
## 394 8h 2021-04-04
## 395 8h 2021-04-05
## 396 8h 2021-04-06
## 397 8h 2021-04-07
## 398 8h 2021-04-08
## 399 8h 2021-04-09
## 400 8h 2021-04-11
## 401 8h 2021-04-12
## 402 8h 2021-04-13
## 403 8h 2021-04-14
## 404 8h 2021-04-15
## 405 8h 2021-04-17
## 406 8h 2021-04-18
## 407 8h 2021-03-30
## 408 8h 2021-03-31
unique(pheno[,c(3,7)])
## time s1
## 1 01.04.2021 1
## 2 02.04.2021 1
## 3 04.04.2021 1
## 4 05.04.2021 1
## 5 06.04.2021 1
## 6 07.04.2021 1
## 7 08.04.2021 1
## 8 09.04.2021 1
## 9 11.04.2021 1
## 10 12.04.2021 1
## 11 13.04.2021 1
## 12 14.04.2021 1
## 13 15.04.2021 1
## 14 17.04.2021 1
## 15 18.04.2021 1
## 16 30.03.2021 1
## 17 31.03.2021 1
## 392 01.04.2021 2
## 393 02.04.2021 2
## 394 04.04.2021 2
## 395 05.04.2021 2
## 396 06.04.2021 2
## 397 07.04.2021 2
## 398 08.04.2021 2
## 399 09.04.2021 2
## 400 11.04.2021 2
## 401 12.04.2021 2
## 402 13.04.2021 2
## 403 14.04.2021 2
## 404 15.04.2021 2
## 405 17.04.2021 2
## 406 18.04.2021 2
## 407 30.03.2021 2
## 408 31.03.2021 2
So, we recalculate the chlorophyll estimates from the rrBLUP estimates and compare them to the real measurements. Let’s see if the correlation of the predicted and real values is still as high as before.
# extract the marker effects as before
u = as.vector(se$u)
# use this to generate a genomic blup value for each genotype
blup = as.matrix(geno3) %*% u
# construct an estimate for the environment and the time as well
env = pheno[,7] * se$beta[1]
time = pheno[,8] * se$beta[2]
# add the fixed effects environment and time to the blup value, to retain an actual estimate of the chlorophyll content
se_estimate = env + time + as.vector(blup)
# compare to the real values we used to regress the model
cor(pheno$Relative.Chlorophyll, se_estimate)
## [1] 0.9274582
# make a figure
plot(pheno$Relative.Chlorophyll, se_estimate)
The fit is still pretty good, but the bias has increased. Anyway, the model has not done any major under fitting. But what about over fitting? We have to make sure the predictions also fit some other data, which was not shown to the algorithm. So let’s make a 10-fold cross validation and check the correlations of these genotypes which have not been shown to the model.
# label each entry in the pheno file as one of 10 folds
pheno$CrossFold = sample(factor(rep(1:10, length.out=nrow(pheno)), labels=paste0("CF", 1:10)))
# generate a matrix where we store the correlations for each fold
GP = data.frame(matrix(ncol=3, nrow=0)) # check the test and train correlation seperately
# we run a loop across all 10 folds
for(i in unique(pheno$CrossFold)){
# split the pheno in a test and a training set
train_pheno = pheno[pheno$CrossFold != i,]
test_pheno = pheno[pheno$CrossFold == i,]
# split the genotypes in a test and training data set
train_geno = geno3[pheno$CrossFold != i,]
test_geno = geno3[pheno$CrossFold == i,]
# run the genomic prediction with the ridge regression method
SE = mixed.solve(train_pheno$Relative.Chlorophyll, Z = train_geno, X = train_pheno[,c(7,8)])
# extract the marker effects as before
u = as.vector(SE$u) # similar for the test and training set
# use this to generate a genomic blup value for each genotype
blup_test = as.matrix(test_geno) %*% u
blup_train = as.matrix(train_geno) %*% u
# construct an estimate for the environment and the time as well
env_train = train_pheno[,7] * SE$beta[1]
time_train = train_pheno[,8] * SE$beta[2]
env_test = test_pheno[,7] * SE$beta[1]
time_test = test_pheno[,8] * SE$beta[2]
# add the fixed effects environment and time to the blup value, to retain an actual estimate of the chlorophyll content
se_estimate_train = env_train + time_train + as.vector(blup_train)
se_estimate_test = env_test + time_test + as.vector(blup_test)
# compare to the real values we used to regress the model
a = cor(train_pheno$Relative.Chlorophyll, se_estimate_train)
b = cor(test_pheno$Relative.Chlorophyll, se_estimate_test)
GP = rbind(GP,c(a,b, i))
}
colnames(GP) = c("Train", "Test", "Fold")
GP = GP[order(GP$Fold),]
GP$Train = round(as.numeric(GP$Train), digits = 3)
GP$Test = round(as.numeric(GP$Test), digits = 3)
GP
## Train Test Fold
## 10 0.925 0.945 CF1
## 1 0.928 0.923 CF10
## 9 0.927 0.929 CF2
## 2 0.928 0.916 CF3
## 8 0.928 0.921 CF4
## 3 0.929 0.907 CF5
## 5 0.929 0.915 CF6
## 7 0.929 0.911 CF7
## 6 0.927 0.925 CF8
## 4 0.927 0.934 CF9
We want to extract a mean correlation across all cross-folds. So we use the formula presented in the lecture.
sum(GP$Train) / nrow(GP) # the training data
## [1] 0.9277
sum(GP$Test) / nrow(GP) # the testing data
## [1] 0.9226
Make a fancy plot, where the time and environment are colored (For the last cross-fold)
test_pheno$prediction = se_estimate_test
library(ggplot2)
ggplot(test_pheno, aes(x=Relative.Chlorophyll, y=se_estimate_test, shape = Environment, color = Time, linewidth= Time, label = Genotype)) +
geom_point() + geom_smooth() +
ylab("Predicted") + xlab("Relative Chlorophyll content")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour, linewidth, label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
What we can see from this plot is, that the ridge regression nicely can predict genotype’s chlorophyll contents on all dates and in both environments. The score of 0.94 is impressive and indicates, that we do not have any over or under fitting. Anyway, we have to be a bit careful with the interpretation. When having a closer look, the correlation for the 8h environment is less strong than for the 16h. So the partial correlation will be much lower than the 0.94 mentioned before.
To test these sub samples, we will simply split the correlations
in an 8h and 16h environment and analyse them seperately. We will use
the result from the last cross-validation performed before.
t16 = test_pheno[test_pheno$Environment == "16h",]
t8 = test_pheno[test_pheno$Environment == "8h",]
cor(t8$Relative.Chlorophyll, t8$prediction)
## [1] 0.6916691
cor(t16$Relative.Chlorophyll, t16$prediction)
## [1] 0.6269762
So what is the take from this? We see a sigificantly lower correlation of 0.67 & 0.81, where the 16h environment has a better prediction ability than the 8h environment. Although not as high as before, the values are still considered as good.
Another interesting aspect could be to test if we can predict the values for the last 3 days correctly, if we only model using the first dates. If so, this could allow us to make assumptions on the plant performance beyond the collected time period.
# convert time to day of year
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
pheno$DOY = yday(pheno$Time)
# exclude the three highest values
sort(unique(pheno$DOY)) # 105 107 108
## [1] 89 90 91 92 94 95 96 97 98 99 101 102 103 104 105 107 108
# extract the test & training data
pheno_train = pheno[pheno$DOY < 105,]
pheno_test = pheno[pheno$DOY > 104,]
# same on the geno data
geno_train = geno3[pheno$DOY < 105,]
geno_test = geno3[pheno$DOY > 104,]
# run the genomic prediction on the train set
SE = mixed.solve(pheno_train$Relative.Chlorophyll, Z = geno_train, X = pheno_train[,c(7,8)])
# extract the marker effects as before
u = as.vector(SE$u) # similar for the test and training set
# use this to generate a genomic blup value for each genotype
blup_test = as.matrix(geno_test) %*% u
blup_train = as.matrix(geno_train) %*% u
# construct an estimate for the environment and the time as well
env_train = pheno_train[,7] * SE$beta[1]
time_train = pheno_train[,8] * SE$beta[2]
env_test = pheno_test[,7] * SE$beta[1]
time_test = pheno_test[,8] * SE$beta[2]
# add the fixed effects environment and time to the blup value, to retain an actual estimate of the chlorophyll content
se_estimate_train = env_train + time_train + as.vector(blup_train)
se_estimate_test = env_test + time_test + as.vector(blup_test)
# compare to the real values we used to regress the model
cor(pheno_train$Relative.Chlorophyll, se_estimate_train)
## [1] 0.9303906
cor(pheno_test$Relative.Chlorophyll, se_estimate_test)
## [1] 0.9090312
ggplot(pheno_test, aes(x=Relative.Chlorophyll, y=se_estimate_test, shape = Environment, color = Time, size= Time, label = Genotype)) + geom_point() + ylab("Predicted") + xlab("Relative Chlorophyll content")
We can see a low bias & low variance for this prediction model as well, so we can say - yes, the first couple of days can be used to predict the later ones.
We have to keep in mind, we cannot simply translate the Za predictions of the GBLUP model from an train to a testing set. It is possible though, but we will not do it for now.
# calculate the Kinship for the test set
K = rrBLUP::A.mat(geno_train)
# genomic prediction using a genomic blup model
gb = mixed.solve(pheno_train$Relative.Chlorophyll, K = K, X = pheno_train[,c(7,8)])
blup = gb$u
# NOTE, as we do not get any marker effects values, we have to take the output as it came from the GBLUP for each genotype
# add the fixed y-axis intercept to the blup value, to retain an actual estimate of the chlorophyll content
gb_estimate = as.vector(gb$beta) + blup
# construct an estimate for the environment and the time as well
env_train = pheno_train[,7] * gb$beta[1]
time_train = pheno_train[,8] * gb$beta[2]
# add the fixed effects environment and time to the blup value, to retain an actual estimate of the chlorophyll content
se_estimate_train = env_train + time_train + as.vector(gb_estimate)
# compare to the real values we used to regress the model
cor(pheno_train$Relative.Chlorophyll, se_estimate_train)
## [1] 0.3415699
ggplot(pheno_train, aes(x=Relative.Chlorophyll, y=se_estimate_train, shape = Environment, color = Time, size= Time, label = Genotype)) +
geom_point() + ylab("Predicted") + xlab("Relative Chlorophyll content")
Puh, this actually does not looks to nice. A correlation of the
training set of 0.36 indicates a clear unterfit.
Let’s have a quick look on the figure regarding underfit, and look up what might be the problem
Moddel fitting sweet spot
The figure also indicates, that the GBLUP model cannot handle the additional complexity of environments and time we just added.
So we can summarize: the ridge regression
outperformed the GBLUP model in terms of prediction ability, if we
increase the complexity of the model by adding additional fixed
factors.
The sommer package (Solving Mixed Model Equations in R) can be applied in a much broader sense.
Besides a ridge regression mixed model, it allows
you to run a genomic blup (with optionally multiple
covariance matrices). It can calulate the genomic
heritability and allows also to use none genomic data with
fixed, random effects, and an included covariance structure. It is a
very versatile package which can be seem as an analogue tool to
lme4 to apply to all sorts of mixed models, where a
covariance structure is useful.
For us, this tool can do everything we previously did with rrBLUP, but has improved flexibilities (But also complexity in terms of code writing).
So, let’s have a look how we calculate a genomic BLUP prediction with the sommer package
First, we need to load it
#install.packages("sommer")
library(sommer)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: crayon
##
## Attaching package: 'crayon'
## The following object is masked from 'package:ggplot2':
##
## %+%
##
## Attaching package: 'sommer'
## The following objects are masked from 'package:rrBLUP':
##
## A.mat, GWAS
In the sommer package, we have the option to include the dominance and epistatic relationship matrix in the prediction process. So let’s go back to our simple, one environment, one date, examples, and check out how to run a GBLUP with - an additive relationship matrix (used before and known as Kinship) - a dominance relationship matrix - an epistatic relationship matrix
We can generate all these matrices by sommer functions.
NOTE - we calculate the Kinship of the genotypes
without the Environment/Time replication (as in geno3)
K = A.mat(as.matrix(geno2))
D = D.mat(as.matrix(geno2))
E = E.mat(as.matrix(geno2))
# please note, all three of them allow to impute missing values using
# K = A.mat(as.matrix(geno2), return.imputed = T)
# imputed = K$X
# K = K$A
RECAP - this is the data we are using
#ph = pheno[pheno$Environment== "16h" & pheno$time=="01.04.2021",c(2,5)]
head(ph)
## Genotype Relative.Chlorophyll
## 1 G1 44.34587
## 18 G2 50.55244
## 35 G3 50.61617
## 52 G4 51.82509
## 69 G5 45.99872
## 86 G6 48.83685
There is some additional preparation required to run the GBLUP - we need to add for each relationship matrix a column with the genotype identifiers in the phenotype file
ph$id = ph$idd = ph$iD = ph$Genotype
row.names(ph) = ph$Genotype
ph = ph[,-1]
Before we run the genomic prediction, let’s have a quick look at
the reference manual of the mmer function, which actually
does the GBLUP for us.
Now, lets have a look what it is doing and what needs to be specified.
| mmer | R Documentation |
The mmer function uses the Direct-Inversion Newton-Raphson or Average Information coded in C++ using the Armadillo library to optimize dense matrix operations common in genomic selection models. These algorithms are intended to be used for problems of the type c > r (more coefficients to estimate than records in the dataset) and/or dense matrices. For problems with sparse data, or problems of the type r > c (more records in the dataset than coefficients to estimate), the MME-based algorithm in the mmec function is faster and we recommend to shift to use that function.
mmer(fixed, random, rcov, data, weights, W, nIters=20, tolParConvLL = 1e-03,
tolParInv = 1e-06, init=NULL, constraints=NULL,method="NR", getPEV=TRUE,
naMethodX="exclude", naMethodY="exclude",returnParam=FALSE,
dateWarning=TRUE,date.warning=TRUE,verbose=TRUE, reshapeOutput=TRUE, stepWeight=NULL,
emWeight=NULL)
fixed |
A formula specifying the response variable(s) and fixed effects, i.e: response ~ covariate for univariate models cbind(response.i,response.j) ~ covariate for multivariate models The |
random |
A formula specifying the name of the random effects, i.e. random= ~ genotype + year. Useful functions can be used to fit heterogeneous variances and other special models (see 'Special Functions' in the Details section for more information):
** ** **
|
rcov |
A formula specifying the name of the error term, i.e. rcov= ~ units. Special heterogeneous and special variance models and constraints for the residual part are the same used on the random term but the name of the random effect is always "units" which can be thought as a column with as many levels as rows in the data, i.e. rcov=~vsr(dsr(covariate),units) |
data |
A data frame containing the variables specified in the formulas for response, fixed, and random effects. |
weights |
Name of the covariate for weights. To be used for the product R = Wsi*R*Wsi, where * is the matrix product, Wsi is the square root of the inverse of W and R is the residual matrix. |
W |
Alternatively, instead of providing a vector of weights the user can specify an entire W matrix (e.g., when covariances exist). To be used first to produce Wis = solve(chol(W)), and then calculate R = Wsi*R*Wsi.t(), where * is the matrix product, and R is the residual matrix. Only one of the arguments weights or W should be used. If both are indicated W will be given the preference. |
nIters |
Maximum number of iterations allowed. |
tolParConvLL |
Convergence criteria for the change in log-likelihood. |
tolParInv |
Tolerance parameter for matrix inverse used when singularities are encountered in the estimation procedure. |
init |
Initial values for the variance components. By default this is NULL and initial values for the variance components are provided by the algorithm, but in case the user want to provide initial values for ALL var-cov components this argument is functional. It has to be provided as a list, where each list element corresponds to one random effect (1x1 matrix) and if multitrait model is pursued each element of the list is a matrix of variance covariance components among traits for such random effect. Initial values can also be provided in the Gti argument of the vsr function. Is highly encouraged to use the Gti and Gtc arguments of the vsr function instead of this argument, but these argument can be used to provide all initial values at once |
constraints |
When initial values are provided these have to be accompanied by their constraints. See the vsr function for more details on the constraints. Is highly encouraged to use the Gti and Gtc arguments of the vsr function instead of this argument but these argument can be used to provide all constraints at once. |
method |
This refers to the method or algorithm to be used for estimating variance components. Direct-inversion Newton-Raphson NR and Average Information AI (Tunnicliffe 1989; Gilmour et al. 1995; Lee et al. 2015). |
getPEV |
A TRUE/FALSE value indicating if the program should return the predicted error variance and variance for random effects. This option is provided since this can take a long time for certain models where p is > n by a big extent. |
naMethodX |
One of the two possible values; "include" or "exclude". If "include" is selected then the function will impute the X matrices for fixed effects with the median value. If "exclude" is selected it will get rid of all rows with missing values for the X (fixed) covariates. The default is "exclude". The "include" option should be used carefully. |
naMethodY |
One of the three possible values; "include", "include2" or "exclude" (default) to treat the observations in response variable to be used in the estimation of variance components. The first option "include" will impute the response variables for all rows with the median value, whereas "include2" imputes the responses only for rows where there is observation(s) for at least one of the responses (only available in the multi-response models). If "exclude" is selected (default) it will get rid of rows in response(s) where missing values are present for at least one of the responses. |
returnParam |
A TRUE/FALSE value to indicate if the program should return the parameters to be used for fitting the model instead of fitting the model. |
dateWarning |
A TRUE/FALSE value to indicate if the program should warn you when is time to update the sommer package. |
date.warning |
A TRUE/FALSE value to indicate if the program should warn you when is time to update the sommer package. This argument will be removed soon, just left for backcompatibility. |
verbose |
A TRUE/FALSE value to indicate if the program should return the progress of the iterative algorithm. |
reshapeOutput |
A TRUE/FALSE value to indicate if the output should be reshaped to be easier to interpret for the user, some information is missing from the multivariate models for an easy interpretation. |
stepWeight |
A vector of values (of length equal to the number of iterations) indicating the weight used to multiply the update (delta) for variance components at each iteration. If NULL the 1st iteration will be multiplied by 0.5, the 2nd by 0.7, and the rest by 0.9. This argument can help to avoid that variance components go outside the parameter space in the initial iterations which doesn't happen very often with the NR method but it can be detected by looking at the behavior of the likelihood. In that case you may want to give a smaller weight to the initial 8-10 iterations. |
emWeight |
A vector of values (of length equal to the number of iterations) indicating with values between 0 and 1 the weight assigned to the EM information matrix. And the values 1 - emWeight will be applied to the NR/AI information matrix to produce a joint information matrix. |
The use of this function requires a good understanding of mixed models. Please review the 'sommer.quick.start' vignette and pay attention to details like format of your random and fixed variables (e.g. character and factor variables have different properties when returning BLUEs or BLUPs, please see the 'sommer.changes.and.faqs' vignette).
For tutorials on how to perform different analysis with sommer please look at the vignettes by typing in the terminal:
vignette("v1.sommer.quick.start")
vignette("v2.sommer.changes.and.faqs")
vignette("v3.sommer.qg")
vignette("v4.sommer.gxe")
Citation
Type citation("sommer") to know how to cite the sommer package in your publications.
Special variance structures
vsr(atr(x,levels),y)
can be used to specify heterogeneous variance for the "y" covariate at specific levels of the covariate "x", i.e. random=~vsr(at(Location,c("A","B")),ID) fits a variance component for ID at levels A and B of the covariate Location.
vsr(dsr(x),y)
can be used to specify a diagonal covariance structure for the "y" covariate for all levels of the covariate "x", i.e. random=~vsr(dsr(Location),ID) fits a variance component for ID at all levels of the covariate Location.
vsr(usr(x),y)
can be used to specify an unstructured covariance structure for the "y" covariate for all levels of the covariate "x", i.e. random=~vsr(usr(Location),ID) fits variance and covariance components for ID at all levels of the covariate Location.
vsr(overlay(...,rlist=NULL,prefix=NULL))
can be used to specify overlay of design matrices between consecutive random effects specified, i.e. random=~vsr(overlay(male,female)) overlays (overlaps) the incidence matrices for the male and female random effects to obtain a single variance component for both effects. The 'rlist' argument is a list with each element being a numeric value that multiplies the incidence matrix to be overlayed. See overlay for details.Can be combined with vsr().
vsr(leg(x,n),y)
can be used to fit a random regression model using a numerical variable x that marks the trayectory for the random effect y. The leg function can be combined with the special functions dsr, usr at and csr. For example random=~vsr(leg(x,1),y) or random=~vsr(usr(leg(x,1)),y).
vsr(x,Gtc=fcm(v))
can be used to constrain fixed effects in the multi-response mixed models. This is a vector that specifies if the fixed effect is to be estimated for such trait. For example fixed=cbind(response.i, response.j)~vsr(Rowf, Gtc=fcm(c(1,0))) means that the fixed effect Rowf should only be estimated for the first response and the second should only have the intercept.
gvsr(x,y)
can be used to fit variance and covariance parameters between two or more random effects. For example, indirect genetic effect models.
spl2Da(x.coord, y.coord, at.var, at.levels))
can be used to fit a 2-dimensional spline (i.e. spatial modeling) using coordinates x.coord and y.coord (in numeric class) assuming a single variance component. The 2D spline can be fitted at specific levels using the at.var and at.levels arguments. For example random=~spl2Da(x.coord=Row,y.coord=Range,at.var=FIELD).
spl2Db(x.coord, y.coord, at.var, at.levels))
can be used to fit a 2-dimensional spline (i.e. spatial modeling) using coordinates x.coord and y.coord (in numeric class) assuming multiple variance components. The 2D spline can be fitted at specific levels using the at.var and at.levels arguments. For example random=~spl2Db(x.coord=Row,y.coord=Range,at.var=FIELD).
S3 methods
S3 methods are available for some parameter extraction such as fitted.mmer, residuals.mmer, summary.mmer, randef, coef.mmer, anova.mmer, plot.mmer, and predict.mmer to obtain adjusted means. In addition, the vpredict function (replacement of the pin function) can be used to estimate standard errors for linear combinations of variance components (i.e. ratios like h2).
Additional Functions
Additional functions for genetic analysis have been included such as relationship matrix building (A.mat, D.mat, E.mat, H.mat), build a genotypic hybrid marker matrix (build.HMM), plot of genetic maps (map.plot), and manhattan plots (manhattan). If you need to build a pedigree-based relationship matrix use the getA function from the pedigreemm package.
Bug report and contact
If you have any technical questions or suggestions please post it in https://stackoverflow.com or https://stats.stackexchange.com
If you have any bug report please go to https://github.com/covaruber/sommer or send me an email to address it asap, just make sure you have read the vignettes carefully before sending your question.
Example Datasets
The package has been equiped with several datasets to learn how to use the sommer package:
* DT_halfdiallel, DT_fulldiallel and DT_mohring datasets have examples to fit half and full diallel designs.
* DT_h2 to calculate heritability
* DT_cornhybrids and DT_technow datasets to perform genomic prediction in hybrid single crosses
* DT_wheat dataset to do genomic prediction in single crosses in species displaying only additive effects.
* DT_cpdata dataset to fit genomic prediction models within a biparental population coming from 2 highly heterozygous parents including additive, dominance and epistatic effects.
* DT_polyploid to fit genomic prediction and GWAS analysis in polyploids.
* DT_gryphon data contains an example of an animal model including pedigree information.
* DT_btdata dataset contains an animal (birds) model.
* DT_legendre simulated dataset for random regression model.
* DT_sleepstudy dataset to know how to translate lme4 models to sommer models.
* DT_ige dataset to show how to fit indirect genetic effect models.
Models Enabled
For details about the models enabled and more information about the covariance structures please check the help page of the package (sommer).
If all parameters are correctly indicated the program will return a list with the following information:
Vi |
the inverse of the phenotypic variance matrix V^- = (ZGZ+R)^-1 |
P |
the projection matrix Vi - [Vi*(X*Vi*X)^-*Vi] |
sigma |
a list with the values of the variance-covariance components with one list element for each random effect. |
sigma_scaled |
a list with the values of the scaled variance-covariance components with one list element for each random effect. |
sigmaSE |
Hessian matrix containing the variance-covariance for the variance components. SE's can be obtained taking the square root of the diagonal values of the Hessian. |
Beta |
a data frame for trait BLUEs (fixed effects). |
VarBeta |
a variance-covariance matrix for trait BLUEs |
U |
a list (one element for each random effect) with a data frame for trait BLUPs. |
VarU |
a list (one element for each random effect) with the variance-covariance matrix for trait BLUPs. |
PevU |
a list (one element for each random effect) with the predicted error variance matrix for trait BLUPs. |
fitted |
Fitted values y.hat=XB |
residuals |
Residual values e = Y - XB |
AIC |
Akaike information criterion |
BIC |
Bayesian information criterion |
convergence |
a TRUE/FALSE statement indicating if the model converged. |
monitor |
The values of log-likelihood and variance-covariance components across iterations during the REML estimation. |
percChange |
The percent change of variance components across iterations. There should be one column less than the number of iterations. Calculated as percChange = ((x_i/x_i-1) - 1) * 100 where i is the ith iteration. |
dL |
The vector of first derivatives of the likelihood with respect to the ith variance-covariance component. |
dL2 |
The matrix of second derivatives of the likelihood with respect to the i.j th variance-covariance component. |
method |
The method for extimation of variance components specified by the user. |
call |
Formula for fixed, random and rcov used. |
constraints |
contraints used in the mixed models for the random effects. |
constraintsF |
contraints used in the mixed models for the fixed effects. |
data |
The dataset used in the model after removing missing records for the response variable. |
dataOriginal |
The original dataset used in the model. |
terms |
The name of terms for responses, fixed, random and residual effects in the model. |
termsN |
The number of effects associated to fixed, random and residual effects in the model. |
sigmaVector |
a vectorized version of the sigma element (variance-covariance components) to match easily the standard errors of the var-cov components stored in the element sigmaSE. |
reshapeOutput |
The value provided to the mmer function for the argument with the same name. |
Giovanny Covarrubias-Pazaran
Covarrubias-Pazaran G. Genome assisted prediction of quantitative traits using the R package sommer. PLoS ONE 2016, 11(6): doi:10.1371/journal.pone.0156744
Covarrubias-Pazaran G. 2018. Software update: Moving the R package sommer to multivariate mixed models for genome-assisted prediction. doi: https://doi.org/10.1101/354639
Bernardo Rex. 2010. Breeding for quantitative traits in plants. Second edition. Stemma Press. 390 pp.
Gilmour et al. 1995. Average Information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51(4):1440-1450.
Kang et al. 2008. Efficient control of population structure in model organism association mapping. Genetics 178:1709-1723.
Lee, D.-J., Durban, M., and Eilers, P.H.C. (2013). Efficient two-dimensional smoothing with P-spline ANOVA mixed models and nested bases. Computational Statistics and Data Analysis, 61, 22 - 37.
Lee et al. 2015. MTG2: An efficient algorithm for multivariate linear mixed model analysis based on genomic information. Cold Spring Harbor. doi: http://dx.doi.org/10.1101/027201.
Maier et al. 2015. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet; 96(2):283-294.
Rodriguez-Alvarez, Maria Xose, et al. Correcting for spatial heterogeneity in plant breeding experiments with P-splines. Spatial Statistics 23 (2018): 52-71.
Searle. 1993. Applying the EM algorithm to calculating ML and REML estimates of variance components. Paper invited for the 1993 American Statistical Association Meeting, San Francisco.
Yu et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Genetics 38:203-208.
Tunnicliffe W. 1989. On the use of marginal likelihood in time series model estimation. JRSS 51(1):15-27.
Zhang et al. 2010. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42:355-360.
####=========================================####
#### For CRAN time limitations most lines in the
#### examples are silenced with one '#' mark,
#### remove them and run the examples
####=========================================####
####=========================================####
#### EXAMPLES
#### Different models with sommer
####=========================================####
data(DT_example)
DT <- DT_example
head(DT)
####=========================================####
#### Univariate homogeneous variance models ####
####=========================================####
## Compound simmetry (CS) model
ans1 <- mmer(Yield~Env,
random= ~ Name + Env:Name,
rcov= ~ units,
data=DT)
summary(ans1)
####===========================================####
#### Univariate heterogeneous variance models ####
####===========================================####
## Compound simmetry (CS) + Diagonal (DIAG) model
ans2 <- mmer(Yield~Env,
random= ~Name + vsr(dsr(Env),Name),
rcov= ~ vsr(dsr(Env),units),
data=DT)
summary(ans2)
####===========================================####
#### Univariate unstructured variance models ####
####===========================================####
ans3 <- mmer(Yield~Env,
random=~ vsr(usr(Env),Name),
rcov=~vsr(dsr(Env),units),
data=DT)
summary(ans3)
# ####==========================================####
# #### Multivariate homogeneous variance models ####
# ####==========================================####
#
# ## Multivariate Compound simmetry (CS) model
# DT$EnvName <- paste(DT$Env,DT$Name)
# ans4 <- mmer(cbind(Yield, Weight) ~ Env,
# random= ~ vsr(Name, Gtc = unsm(2)) + vsr(EnvName,Gtc = unsm(2)),
# rcov= ~ vsr(units, Gtc = unsm(2)),
# data=DT)
# summary(ans4)
#
# ####=============================================####
# #### Multivariate heterogeneous variance models ####
# ####=============================================####
#
# ## Multivariate Compound simmetry (CS) + Diagonal (DIAG) model
# ans5 <- mmer(cbind(Yield, Weight) ~ Env,
# random= ~ vsr(Name, Gtc = unsm(2)) + vsr(dsr(Env),Name, Gtc = unsm(2)),
# rcov= ~ vsr(dsr(Env),units, Gtc = unsm(2)),
# data=DT)
# summary(ans5)
#
# ####===========================================####
# #### Multivariate unstructured variance models ####
# ####===========================================####
#
# ans6 <- mmer(cbind(Yield, Weight) ~ Env,
# random= ~ vsr(usr(Env),Name, Gtc = unsm(2)),
# rcov= ~ vsr(dsr(Env),units, Gtc = unsm(2)),
# data=DT)
# summary(ans6)
#
# ####=========================================####
# ####=========================================####
# #### EXAMPLE SET 2
# #### 2 variance components
# #### one random effect with variance covariance structure
# ####=========================================####
# ####=========================================####
#
# data("DT_cpdata")
# DT <- DT_cpdata
# GT <- GT_cpdata
# MP <- MP_cpdata
# head(DT)
# GT[1:4,1:4]
# #### create the variance-covariance matrix
# A <- A.mat(GT)
# #### look at the data and fit the model
# mix1 <- mmer(Yield~1,
# random=~vsr(id, Gu=A) + Rowf,
# rcov=~units,
# data=DT)
# summary(mix1)$varcomp
# #### calculate heritability
# vpredict(mix1, h1 ~ V1/(V1+V3) )
# #### multi trait example
# mix2 <- mmer(cbind(Yield,color)~1,
# random = ~ vsr(id, Gu=A, Gtc = unsm(2)) + # unstructured at trait level
# vsr(Rowf, Gtc=diag(2)) + # diagonal structure at trait level
# vsr(Colf, Gtc=diag(2)), # diagonal structure at trait level
# rcov = ~ vsr(units, Gtc = unsm(2)), # unstructured at trait level
# data=DT)
# summary(mix2)
#
# ####=========================================####
# #### EXAMPLE SET 3
# #### comparison with lmer, install 'lme4'
# #### and run the code below
# ####=========================================####
#
# #### lmer cannot use var-cov matrices so we will not
# #### use them in this comparison example
#
# library(lme4)
# library(sommer)
# data("DT_cornhybrids")
# DT <- DT_cornhybrids
# DTi <- DTi_cornhybrids
# GT <- GT_cornhybrids
#
# fm1 <- lmer(Yield ~ Location + (1|GCA1) + (1|GCA2) + (1|SCA),
# data=DT )
# out <- mmer(Yield ~ Location,
# random = ~ GCA1 + GCA2 + SCA,
# rcov = ~ units,
# data=DT)
# summary(fm1)
# summary(out)
# ### same BLUPs for GCA1, GCA2, SCA than lme4
# plot(out$U$GCA1$Yield, ranef(fm1)$GCA1[,1])
# plot(out$U$GCA2$Yield, ranef(fm1)$GCA2[,1])
# vv=which(abs(out$U$SCA$Yield) > 0)
# plot(out$U$SCA$Yield[vv], ranef(fm1)$SCA[,1])
#
# ### a more complex model specifying which locations
# head(DT)
# out2 <- mmer(Yield ~ Location,
# random = ~ vsr(atr(Location,c("3","4")),GCA2) +
# vsr(atr(Location,c("3","4")),SCA),
# rcov = ~ vsr(dsr(Location),units),
# data=DT)
# summary(out2)
## GBLUP A
ans1 <- mmer(Relative.Chlorophyll~1,
random=~vsr(id,Gu=K),
rcov=~units, nIters=20, tolParInv= 0.2,
data=ph,verbose = FALSE) # kinship based
Za = as.vector(ans1$U$`u:id`$Relative.Chlorophyll)
beta = ans1$Beta$Estimate
# get the predicted scores
gblup1 = beta + Za
# compare to the actual values
cor(ph$Relative.Chlorophyll, gblup1)
## [1] -0.1164541
plot(ph$Relative.Chlorophyll, gblup1)
############################################################################################################################
## GBLUP A+D
ans2 <- mmer(Relative.Chlorophyll~1,
random=~vsr(id,Gu=K) + vsr(idd,Gu=D),
rcov=~units, nIters=20, tolParInv= 0.2,
data=ph,verbose = FALSE) # kinship based
Za = as.vector(ans2$U$`u:id`$Relative.Chlorophyll)
Zd = as.vector(ans2$U$`u:idd`$Relative.Chlorophyll)
beta = ans2$Beta$Estimate
# get the predicted scores for the additive and dominance matrix
gblup2 = beta + Za + Zd
# compare to the actual values
cor(ph$Relative.Chlorophyll, gblup2)
## [1] -0.1164541
plot(ph$Relative.Chlorophyll, gblup2)
## GBLUP A+D+E
ans <- mmer(Relative.Chlorophyll~1,
random=~vsr(id,Gu=K) + vsr(idd,Gu=D) + vsr(iD,Gu=E),
rcov=~units, nIters=20, tolParInv= 0.2,
data=ph,verbose = FALSE) # kinship based
## System is singular (Inf_norestrain_inv). Stopping the job. Try a bigger number of tolparinv.
# we do not have enough data to calculate the epistatic effects here
We see similar results when comparing the rrBLUP and
sommer GBLUPs.
Lets have a look how it looks like, if we add the environment and the time as fixed co factors
# extract all phenotypes, but only the Relative chlorohyll content column
ph = pheno[,c(2,5,7,8)]
ph$id = ph$idd = ph$iD = ph$Genotype
ph = ph[,-1]
## GBLUP A
ans1 <- mmer(Relative.Chlorophyll~s1+s2,
random=~vsr(id,Gu=K),
rcov=~units, nIters=20, tolParInv= 0.2,
data=ph,verbose = FALSE) # kinship based
Za = as.vector(ans1$U$`u:id`$Relative.Chlorophyll)
beta = ans1$Beta$Estimate
# beta has three values - the y-axis intercept, the corrector for s1, and for s2
intercept = beta[1]
s1 = beta[2]
s2 = beta[3]
# get the predicted scores
gblup1 = intercept + ph$s1 * s1 + ph$s2 * s2 + Za
# compare to the actual values
cor(ph$Relative.Chlorophyll, gblup1)
## [1] 0.7975595
plot(ph$Relative.Chlorophyll, gblup1)
So this correlation actually similar to the one we have seen in
the rrBLUP package. Still, we have to keep in mind, that we
did not include any testing data here. And for the ridge regression, we
had much better predictions of up to 0.77 for the training data set.
Besides, the main assumption we can make from this data is if the
genotype was grown in the 8h or 16h environments - which was not our
intention.
When we calculate the correlation within the environments, it becomes apparent that the accuracy is kind of low.
ph$predict = gblup1
ph16 = ph[ph$s1 == 1,]
ph8 = ph[ph$s1 == 2,]
# calculate the correlations
cor(ph8$Relative.Chlorophyll, ph8$predict)
## [1] 0.07157482
cor(ph16$Relative.Chlorophyll, ph16$predict)
## [1] 0.06047188
These low correlation values actually indicate we are not very good. In fact, these are pretty bad.
Maybe we can increase the prediction ability a bit by adding the dominance relationship matrix to it.
## GBLUP A + D
ans2 <- mmer(Relative.Chlorophyll~s1+s2,
random=~vsr(id,Gu=K) + vsr(idd,Gu=D),
rcov=~units, nIters=20, tolParInv= 0.2,
data=ph,verbose = FALSE) # kinship based
Za = as.vector(ans2$U$`u:id`$Relative.Chlorophyll)
beta = ans2$Beta$Estimate
# beta has three values - the y-axis intercept, the corrector for s1, and for s2
intercept = beta[1]
s1 = beta[2]
s2 = beta[3]
# get the predicted scores
gblup1 = intercept + ph$s1 * s1 + ph$s2 * s2 + Za
ph$predict = gblup1
ph16 = ph[ph$s1 == 1,]
ph8 = ph[ph$s1 == 2,]
# calculate the correlations
cor(ph8$Relative.Chlorophyll, ph8$predict)
## [1] 0.07157482
cor(ph16$Relative.Chlorophyll, ph16$predict)
## [1] 0.06047188
plot(ph$Relative.Chlorophyll, gblup1)
We do not see an improvement, nor a reduction in the prediction
ability. Still, there is a relevant amount of underfit in your
model.
This is highlighted when we give the plot some coloring according to the environment of the measurement.
ph$predicted = gblup1
ggplot(ph, aes(x=Relative.Chlorophyll, y=predicted, shape = as.factor(s1), color = s2, size= s2, label = iD)) +
geom_point() + #geom_text(hjust=0, vjust=0) +
ylab("Predicted") + xlab("Relative Chlorophyll content")
We must conclude, that besides the prediction, from which environment we sampled, we cannot make any predictions precisely. This becomes even more apparent when we calculate the correlation for both environments separately.
so now, where we so gloriously failed with the GBLUP,
lets go back to the ridge regression rrBLUP. This worked in the
rrBLUP package, so why wouldn’t it in the sommer
package?
# this is how a rrBLUP model looks like
ans2 <- mmer(Relative.Chlorophyll~s1+s2, # when you do not have any fixed effects, you simply write "1" instead of s1+s2
random=~vsr(list(geno3), buildGu = FALSE),
rcov=~units, getPEV = FALSE, nIters=20,
data=ph, verbose = FALSE) # kinship based
u <- as.matrix(geno3) %*% as.matrix(ans2$U$`u:geno3`$Relative.Chlorophyll) # BLUPs for individuals
beta = ans2$Beta$Estimate
# beta has three values - the y-axis intercept, the corrector for s1, and for s2
intercept = beta[1]
s1 = beta[2]
s2 = beta[3]
# get the predicted scores
rrblup = intercept + ph$s1 * s1 + ph$s2 * s2 + u
# compare to the actual values
cor(ph$Relative.Chlorophyll, rrblup)
## [,1]
## [1,] 0.9272857
plot(ph$Relative.Chlorophyll, rrblup)
We see the exact same result as we have previously seen.
One think that can be performed in the sommer package is
the addition of another covariance matrix. So, let’s try this one out in
the ridge regression and see its impact on the prediction accuracy.
# a rrBLUP model with an additive covariance matrix added
ans <- mmer(Relative.Chlorophyll~s1+s2,
random=~vsr(list(geno3), buildGu = FALSE) + vsr(id,Gu=K),
rcov=~units, getPEV = FALSE, nIters=20,
data=ph, verbose = FALSE) # kinship based
u <- as.matrix(geno3) %*% as.matrix(ans$U$`u:geno3`$Relative.Chlorophyll) # BLUPs for individuals
beta = ans$Beta$Estimate
# beta has three values - the y-axis intercept, the corrector for s1, and for s2
intercept = beta[1]
s1 = beta[2]
s2 = beta[3]
# get the predicted scores
rrblup = intercept + ph$s1 * s1 + ph$s2 * s2 + u
# compare to the actual values
cor(ph$Relative.Chlorophyll, rrblup)
## [,1]
## [1,] 0.8541736
plot(ph$Relative.Chlorophyll, rrblup)
In this particular case, we can see that there is no improvement
using a covariance matrix. This could be for several reasons, but most
likely just because we do not have a very strong genotype bias, but in
our case, a pretty strong environmental bias.
So we go back to where we came from.
# this is how a rrBLUP model looks like
ans <- mmer(Relative.Chlorophyll~s1+s2, # when you do not have any fixed effects, you simply write "1" instead of s1+s2
random=~vsr(list(geno3), buildGu = FALSE),
rcov=~units, getPEV = FALSE, nIters=20,
data=ph, verbose = FALSE)
u <- as.matrix(geno3) %*% as.matrix(ans$U$`u:geno3`$Relative.Chlorophyll) # BLUPs for individuals
beta = ans$Beta$Estimate
# beta has three values - the y-axis intercept, the corrector for s1, and for s2
intercept = beta[1]
s1 = beta[2]
s2 = beta[3]
# get the predicted scores
rrblup = intercept + ph$s1 * s1 + ph$s2 * s2 + u
# compare to the actual values
cor(ph$Relative.Chlorophyll, rrblup)
## [,1]
## [1,] 0.9272857
You might also wonder, why we do not split in between test and training sets. Indeed, under normal circumstances, you are highly advised to do so. As we have done this already in the previous chapter and figured out that we neighter have an over, nore an under fit, we simple keep all data together.
this is an aspect we have not had a look at yet, but it is highly interesting to reduce the number of 21,000 markers down to a value of maybe 200 markers. We will try to do this in one of the next chapters, but for now, we will only have a look at how big the effects of the markers are. Please keep in mind - the ridge regression applied before shrinks the effects of markers down. So the effects we will observe a very small.
# we want to look at the physical poistions and the distributions of marker effects
# extracting the marker effects
a = as.data.frame(ans$U$`u:geno3`$Relative.Chlorophyll)
# get the position, so we can make a manhattan plot
library(stringr)
a$chr = as.factor(str_split_fixed(row.names(a), "_", 2)[,1])
a$pos = as.integer(str_split_fixed(row.names(a), "_", 2)[,2])
colnames(a) = c("MarkerEffect", "Chromosome", "PhysPos")
# marker effects are multiplied with 100
manplot = ggplot(a, aes(x=PhysPos, y=MarkerEffect*100, fill=Chromosome)) + geom_point(alpha=0.75, shape=21, color="black",linewidth=1.5) +
scale_y_continuous(limits=c(round(min(a$MarkerEffect*100)*1.2, digits = 3), round(max(a$MarkerEffect*100)*1.2, digits = 3)), expand = c(0, 0)) + theme_classic() +
theme(legend.position="none") + ylab("Effect size") + xlab("Chromosome") +
geom_hline(yintercept = c(round(min(a$MarkerEffect*100)*0.6, digits = 5), round(max(a$MarkerEffect*100)*0.6, digits = 5))) +
facet_grid (~Chromosome,scales = "free", space='free',switch="both") +
theme(panel.spacing.x=unit(0.2,"mm"), axis.ticks.x = element_blank(),
strip.background = element_rect(fill = "grey90", colour = "white", size =0), axis.text.x = element_blank(),
axis.text.y = element_text(size=12,angle=0,hjust=1, colour = "black"),
strip.text = element_text(size=10), legend.title=element_text(size=10), axis.title.y = element_text(size=12, face="bold", colour = "black")) + ggtitle("Relative chlorophyll content")
## Warning in geom_point(alpha = 0.75, shape = 21, color = "black", linewidth =
## 1.5): Ignoring unknown parameters: `linewidth`
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
manplot
We can see, that the majority of markers has a small effect. How many markers do have no effect at all? (Marker effect == 0)
sum(a$MarkerEffect==0)
## [1] 0
Of course, the value is 0. Why is that? Remember the formula of the ridge regression lambda * slope^2. So the values can only become close to 2, but never 0. Its different with the lasso, where marker effects can shrink down to 0.
Before we move to the next chapter, we will have a look on how to calculate the genomic heritablity.
model <- mmer(Relative.Chlorophyll~s1+s2, random=~vsr(iD,Gu=K), rcov=~units, nIters=100, data=ph,verbose = TRUE)
## iteration LogLik wall cpu(sec) restrained
## 1 614.758 17:49:58 1 0
## 2 648.128 17:49:59 2 0
## 3 650.354 17:50:1 4 0
## 4 650.475 17:50:2 5 0
## 5 650.476 17:50:3 6 0
## 6 650.476 17:50:4 7 0
h2 = vpredict(model, h2 ~ (V1) / ( V1+V2) )
h2
## Estimate SE
## h2 0.582443 0.07531803
So the genomic heritability is 0.58, which is not bad considering the fact that a GWAS has not found any marker to explain more than 0.5% of the entire variation and we only have 23 genotypes tested in two environments.
We go on with the next section, where we have a look at the genomic prediction approaches using Bayesean statisitics the next section BGLR
The BGLR packages allows us to use Bayesean
implementations to predict the phenotype from genomic data. You can find
the reference here
The core function is BGLR, where all the final fitting
steps are performed. Manuals for almost all thinkable user cases can be
founf on the github website of BGLR BGLR allows us to run the
predictions in various mathematical backgrounds. Besides the
GBLUP and rrBLUP we already worked on before, it also
allows to run BayesA, BayesB, BayesC & a
Bayes lasso. Further, we can calculate the genomic heritability
from any model.
We can have a look at the manual of the function first.
library(BGLR)
Now, lets have a look what it is doing and what needs to be specified.
| BGLR | R Documentation |
The BGLR (‘Bayesian Generalized Linear Regression’) function fits various types of parametric and semi-parametric Bayesian regressions to continuos (censored or not), binary and ordinal outcomes.
BGLR(y, response_type = "gaussian", a=NULL, b=NULL,ETA = NULL, nIter = 1500,
burnIn = 500, thin = 5, saveAt = "", S0 = NULL,
df0 =5, R2 = 0.5, weights = NULL,
verbose = TRUE, rmExistingFiles = TRUE, groups=NULL)
y |
(numeric, |
response_type |
(string) admits values |
a,b |
(numeric, |
ETA |
(list) This is a two-level list used to specify the regression function (or linear predictor). By default the linear predictor (the conditional expectation function in case of Gaussian outcomes) includes only an intercept. Regression on covariates and other types of random effects are specified in this two-level list. For instance:
ETA=list(list(X=W, model="FIXED"),
list(X=Z,model="BL"),
list(K=G,model="RKHS")),
specifies that the linear predictor should include: an intercept (included by default) plus a linear regression on W with regression coefficients treated as fixed effects (i.e., flat prior), plus regression on Z, with regression coefficients modeled as in the Bayesian Lasso of Park and Casella (2008) plus and a random effect with co-variance structure G. For linear regressions the following options are implemented: FIXED (Flat prior), BRR (Gaussian prior), BayesA (scaled-t prior), BL (Double-Exponential prior),
BayesB (two component mixture prior with a point of mass at zero and a scaled-t slab), BayesC (two component mixture prior with a point of
mass at zero and a Gaussian slab). In linear regressions X can be the incidence matrix for effects or a formula (e.g. |
weights |
(numeric, |
nIter,burnIn, thin |
(integer) the number of iterations, burn-in and thinning. |
saveAt |
(string) this may include a path and a pre-fix that will be added to the name of the files that are saved as the program runs. |
S0, df0 |
(numeric) The scale parameter for the scaled inverse-chi squared prior assigned to the residual variance, only used with Gaussian outcomes.
In the parameterization of the scaled-inverse chi square in BGLR the expected values is |
R2 |
(numeric, |
verbose |
(logical) if TRUE the iteration history is printed, default TRUE. |
rmExistingFiles |
(logical) if TRUE removes existing output files from previous runs, default TRUE. |
groups |
(factor) a vector of the same length of y that associates observations with groups, each group will have an associated variance component for the error term. |
BGLR implements a Gibbs sampler for a Bayesian regresion model. The linear predictor (or regression function) includes an intercept (introduced by default) plus a number of user-specified regression components (X) and random effects (u), that is:
η=1μ + X1β1+...+Xpβp+u1+...+uqThe components of the linear predictor are specified in the argument ETA (see above). The user can specify as many linear terms as desired, and for each component the user can choose the prior density to be assigned. The distribution of the response is modeled as a function of the linear predictor.
For Gaussian outcomes, the linear predictor is the conditional expectation, and censoring is allowed. For censored data points the actual response value (y_i)
is missing, and the entries of the vectors a and b (see above) give the lower an upper vound for y_i. The
following table shows the configuration of the triplet (y, a, b) for un-censored, right-censored,
left-censored and interval censored.
| a | y | b | |
| Un-censored | NULL | y_i | NULL |
| Right censored | a_i | NA | \infty |
| Left censored | -\infty | NA | b_i |
| Interval censored | a_i | NA | b_i
|
Internally, censoring is dealt with as a missing data problem.
Ordinal outcomes are modelled using the probit link, implemented via data augmentation. In this case the linear predictor becomes the mean of the underlying liability variable which is normal with mean equal to the linear predictor and variance equal to one. In case of only two classes (binary outcome) the threshold is set equal to zero, for more than two classess thresholds are estimated from the data. Further details about this approach can be found in Albert and Chib (1993).
A list with estimated posterior means, estimated posterior standard deviations, and the parameters used to fit the model. See the vignettes in the package for further details.
Gustavo de los Campos, Paulino Perez Rodriguez,
Albert J,. S. Chib. 1993. Bayesian Analysis of Binary and Polychotomus Response Data. JASA, 88: 669-679.
de los Campos G., H. Naya, D. Gianola, J. Crossa, A. Legarra, E. Manfredi, K. Weigel and J. Cotes. 2009. Predicting Quantitative Traits with Regression Models for Dense Molecular Markers and Pedigree. Genetics 182: 375-385.
de los Campos, G., D. Gianola, G. J. M., Rosa, K. A., Weigel, and J. Crossa. 2010. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genetics Research, 92:295-308.
Park T. and G. Casella. 2008. The Bayesian LASSO. Journal of the American Statistical Association 103: 681-686.
Spiegelhalter, D.J., N.G. Best, B.P. Carlin and A. van der Linde. 2002. Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology) 64 (4): 583-639.
## Not run:
#Demos
library(BGLR)
#BayesA
demo(BA)
#BayesB
demo(BB)
#Bayesian LASSO
demo(BL)
#Bayesian Ridge Regression
demo(BRR)
#BayesCpi
demo(BayesCpi)
#RKHS
demo(RKHS)
#Binary traits
demo(Bernoulli)
#Ordinal traits
demo(ordinal)
#Censored traits
demo(censored)
## End(Not run)
In the following block of code, we will figure out how all those predictions can be performed.
#install.packages("BGLR")
library(BGLR)
X<-geno3
# we generate 10 cross folds, so that each genotype occured in the training and testing
pheno$CrossFold = sample(factor(rep(1:5, length.out=nrow(pheno)), labels=paste0("CF", 1:5)))
######################################################################################################################################################################
### run the genomic prediction with different models and compare all to each other in a plot
phenotype = "Relative.Chlorophyll" # specifing the phenotype of interest (colname in pheno)
testGP = function(Model, phenotype, pheno, model){
# store the results here
GP = data.frame(matrix(ncol=7, nrow=0))
for(i in unique(pheno$CrossFold)){
# cut out the cross validation
ph = pheno
ph[ph$CrossFold == i,phenotype] = NA
ytst = ph[,phenotype]
tst = which(ph$CrossFold == i) # get the position in the vector where the NA was introduced in this itteration
## run the ridge regression genomic prediction
nIter=600; burnIn=100; thin = 10 # increase the itterations and burn-in by at least x10
fmBRR=BGLR(y=ytst,ETA=Model, nIter=nIter,burnIn=burnIn,saveAt='', verbose=F)
if (model == "bayes"){
# get the predicted scores by calculating them ("fmBRR$yhat" gives the same results)
beta = fmBRR$ETA[[1]]$b # effects of environment and time point
a = fmBRR$ETA[[2]]$b # marker effects
u = as.matrix(geno3) %*% as.vector(a) # marker times marker effects (Za)
# fmBRR$mu is the overall y-axis intercept
pred = fmBRR$mu + ph$s1 * beta[1] + ph$s2 * beta[2] + u
}else if (model == "gblup"){
beta = fmBRR$ETA[[1]]$b
u = fmBRR$ETA[[2]]$u
# fmBRR$mu is the overall y-axis intercept
pred = fmBRR$mu + ph$s1 * beta[1] + ph$s2 * beta[2] + u
}
# correlate and plot only the relevant aspect - the test set
# seperate the 8h from the 16h environment for the correlation assessment
# test
test = as.data.frame(cbind(pred[tst], pheno[pheno$CrossFold == i,]))
test16 = test[test$Environment == "16h",]
test8 = test[test$Environment == "8h",]
te1 = cor(test16[,1], test16$Relative.Chlorophyll)
te2 = cor(test8[,1], test8$Relative.Chlorophyll)
r2test = mean(te1, te2)
# train
train = as.data.frame(cbind(pred[-tst], pheno[-tst,]))
train16 = train[train$Environment == "16h",]
train8 = train[train$Environment == "8h",]
ta1 = cor(train16[,1], train16$Relative.Chlorophyll)
ta2 = cor(train8[,1], train8$Relative.Chlorophyll)
r2train = mean(ta1, ta2)
# when you do not need/want to seperate by any factor, go with these two lines
# r2test = cor(fmBRR$yHat[tst], pheno[pheno$CrossFold == i,phenotype])
# r2train = cor(fmBRR$yHat[-tst], pheno[-tst,phenotype])
# print all the meta information in a data frame, so that one can relate the variance of the traits with the correlation
a = var(fmBRR$yHat[-tst])
b = var(fmBRR$yHat[tst])
c = mean(fmBRR$SD.yHat) # variation for both test and training set - as lower the dispersion between both, the better
d = var(pheno[-tst, phenotype])
e = var(pheno[tst, phenotype])
GP = rbind(GP, c(r2test, r2train, a, b, c, d, e))
}
colnames(GP) = c("r2_test", "r2_train", "var_test_pred", "var_train_pred", "var_both_joined_pred", "var_pheno_test", "var_pheno_train")
# export the results
return(GP)
}
# EC70 - attention!! - when the match to the fibl data was made, then a reduced set of markers is used!!!
bayb = testGP(Model = list(list(X=pheno[,c(7:8)], model="FIXED"),
list(X=X,model='BayesB')), phenotype = phenotype , pheno = pheno, model="bayes")
brr = testGP(Model = list(list(X=pheno[,c(7:8)], model="FIXED"),
list(X=X, model='BRR')), phenotype = phenotype, pheno = pheno, model="bayes")
blas = testGP(Model = list(list(X=pheno[,c(7:8)], model="FIXED"),
list(X=X, model='BL')), phenotype = phenotype, pheno = pheno, model="bayes")
bayc = testGP(Model = list(list(X=pheno[,c(7:8)], model="FIXED"),
list(X=X, model='BayesC')), phenotype = phenotype, pheno = pheno, model="bayes")
baya = testGP(Model = list(list(X=pheno[,c(7:8)], model="FIXED"),
list(X=X, model='BayesA')), phenotype = phenotype, pheno = pheno, model="bayes")
K = rrBLUP::A.mat(X)
gblup = testGP(Model = list(list(X=pheno[,c(7:8)], model="FIXED"), list(model = "RKHS", K = K)), phenotype = phenotype, pheno = pheno, model="gblup")
# bring all of them in the same data frame and generate a plot which illustrates the boxplots of each approach
bayb$id = "BayesB"
baya$id = "BayesA"
bayc$id = "BayesC"
blas$id = "BayesLasso"
brr$id = "RidgeRegression"
gblup$id = "GBLUP"
gp1 = rbind(baya, bayb, bayc, blas, brr, gblup)
library(ggplot2)
p1 = ggplot(gp1, aes(x=id, y=r2_test, fill = id))+ geom_hline(yintercept=mean(gp1$r2_test), linetype="solid", color = "red", size=2) +
geom_boxplot() + theme_classic() + theme(legend.position = "none") +
ylab("prediction ability") + xlab("") + ggtitle(paste(ncol(X), "Markers"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
p1
So what did we do here? We calculated the genomic values of markers
(Bayes and ride regression) and the genotype scores (GBLUP) in different
way, all using the BGLR function. All models have in
common, that we included the environment and the time
as a fixed effect in the model.
These results indicate that BayesB and BayesC performed worst. When you remember the lecture, BayesC is a combination of Bayes A/B - and BayesB assumes there are few genotypes with an effect.
So the least thing we can conclude here is, that the train chlorophyll content of young barley leafes is determined by many genes. There are no major QTLs to be found in this set of genotpes which would explain a big portion of the variance alone.
But what, one thing also jumps out - the GBLUP is actually the best model, opposing to what we observed previously in the sommer and rrBLUP packages. Why is that? The implementation of the GBLUP is a bit different from the other two packages - and on top, we were using a lot more iterations & burn-in rounds. These allowed to fit the data much better. So iterations and burn-ins of the models will have an influence on the outcome of the predictions. So more is better - but will require more time.
Anyway, the prediction of the rrBLUP ridge
regression model have shown the best predictions (r=0.77).
In a last step, we want to calculate the genomic heritabilty of any of the previous given models.We use the GBLUP model implementation in BGLR, as done before.
# heritability calculation using the GBLUP model
fm=BGLR(y=pheno$Relative.Chlorophyll, ETA = list(list(X=pheno[,c(7:8)], model="FIXED"), list(model = "RKHS", K = K)) ,nIter=6000,burnIn=1000,verbose=F)
varU=scan('ETA_2_varU.dat') # this and the following file are created on disk
varE=scan('varE.dat')
h2= varU /(varU+varE)
plot(h2,type='o',cex=.5,col=4, main="GBLUP")
mean(h2)
## [1] 0.257827
We get the heritability across all itterations, so we also see
each itteration has a different heritability score. When calculating the
average, we get the estimated heritability. The mean is lower compared
to the sommer package, so we see a relevant difference in how
these two packages implement the GBLUP.
Finding a suitable model to predict the phenotypes from the genotypes is a nice feature. But especially in plant breeding, where we have hundreds of (useless) progenies, it would be just too expensive to genotype all individuals of a cross. Exemplary, the genotyping of an individual with these 21,000 markers we have been working on currently costs 20€. 20€ x 1000 individuals becomes very costly. So our goal is now to make just marginally less good predictions by using much less markers than before. So our target could be less than 100 markers to use for the phenotype predictions.
RECAP To get there, two approaches exist - 1. down up and 2. top down.
We will have a look at the down up method, as there is a nice tool provided, called GMStool Contrasting to the previous packages we used (e.g. sommer), GMStool is pure R code, but will be executed from the terminal. Luckily, we can run terminal code from the R console as well. You can finde further information on the application here
Nevertheless, to run the app, we have to create some files, as
indicated in the illustration below. All this
data has to be stored as a file on the disk, so it has first to be
written to disk, and afterwards you can run the app.
So the work pipeline looks something like this:
prepare the data
run a GWAS to get the p values of each marker
run the code from the shell (system() command)
Nevertheless, there is a major drawback coming with this tool -
at least for us. We cannot include fixed co-factors, like the
environment and the time.
That’s why we will perform a top down approach, which allows
us to use any of the models we have used so far. For sake of speed and
accuracy, we will perform this step using the rrBLUP ridge
regression approach. As we have seen before, the Bayes methods did not
perform superior to the ridge regression approach - but were
significantly slower.
Please note - the marker selection step is not suitable for GBLUP, as the relationship matrix might be altered dramatically by these major reduction steps and the variance of the prediction may become very big.
So we start with the same step we previously performed
##### 1. Genomic prediciton with all markers As we did not
observe any relevant over or under fit in the parental lines tested, we
will simply use all genotypes at once and will not split into a training
and testing set.
# run the GP with fixed co variables
SE = mixed.solve(pheno$Relative.Chlorophyll, Z = geno3, X = pheno[,c(7,8)])
# what is the prediction accuracy, when we use all markers?
marker_effect = SE$u # center around 0
beta = SE$beta # baseline to which the blup values differ
predicted_result = pheno$s1 * beta[1] + pheno$s2 * beta[2] + as.matrix(geno3) %*% marker_effect
# we ceck the accuracy for ech env seperatly
test = cbind(predicted_result, pheno)
t16 = test[test$Environment=="16h",]
t8 = test[test$Environment=="8h" ,]
baseline16 = cor(t16$Relative.Chlorophyll, t16$predicted_result, use = "complete")
baseline8 = cor(t8$Relative.Chlorophyll, t8$predicted_result, use = "complete")
mean(c(baseline8, baseline16))
## [1] 0.6610809
# get the marker values
u = as.data.frame(SE$u)
u$Chromosome = str_split_fixed(row.names(u), "_", 2)[,1]
u$PhysPos = as.integer(str_split_fixed(row.names(u), "_", 2)[,2])
colnames(u)[1] = "MarkerEffect"
# center the marker effects around 0
u$MarkerEffect = u$MarkerEffect - mean(u$MarkerEffect)
# plot the marker values on the genome with a customized cut off value
manplot = ggplot(u, aes(x=PhysPos, y=MarkerEffect*100, fill=Chromosome)) +
geom_point(alpha=0.75, shape=21, color="black",size=1.5) +
scale_y_continuous(limits=c(round(min(u$MarkerEffect*100)*1.2, digits = 3),
round(max(u$MarkerEffect*100)*1.2, digits = 3)),
expand = c(0, 0)) + theme_classic() +
theme(legend.position="none") + ylab("Effect size") + xlab("Chromosome") +
geom_hline(yintercept = c(round(min(u$MarkerEffect*100)*0.7, digits = 5),
round(max(u$MarkerEffect*100)*0.7, digits = 5))) +
facet_grid (~Chromosome,scales = "free", space='free',switch="both") +
theme(panel.spacing.x=unit(0.2,"mm"), axis.ticks.x = element_blank(),
strip.background = element_rect(fill = "grey90", colour = "white", size =0),
axis.text.x = element_blank(),
axis.text.y = element_text(size=12,angle=0,hjust=1, colour = "black"),
strip.text = element_text(size=10), legend.title=element_text(size=10),
axis.title.y = element_text(size=12, face="bold", colour = "black")) +
ggtitle("Relative chlorophyll content")
manplot
So for the record, when we use all 21,000 markers, the best fitting model has an Pearson correlation of 0.749. We will see after the marker reduction, how this value has changed.
We will only consider all those markers which have a relatively high effect. Therefore, we select the ones with an effect of at least 90% of the highest observed marker effect.
# we unselect all markers with an effect lower than the 80% quantile of the highest observed effects observed
u2 = u[u$MarkerEffect < min(u$MarkerEffect)*0.9 | u$MarkerEffect > max(u$MarkerEffect)*0.9,]
There are 45 markers left which have shown an either high positive or negative impact on the phenotype. Please, note, we performed an scaling of the marker effects proir this step, so that we picked the markers from an equal distance to the mean of all marker effects (wich was negative before scaling).
<br< ##### 3. run another round of genomic prediction
This time, we only use these 45 markers and check how well they predict the phenotypes
# extract the markers from the genotype file
gsubset = geno3[,colnames(geno3) %in% row.names(u2)]
## test these marker in the genomic prediction model
prediction = mixed.solve(pheno$Relative.Chlorophyll, Z = gsubset, X = pheno[,c(7,8)])
marker_effect = prediction$u # center around 0
beta = prediction$beta # baseline to which the blup values differ
predicted_result = pheno$s1 * beta[1] + pheno$s2 * beta[2] + as.matrix(gsubset) %*% marker_effect
# we ceck the accuracy for ech env seperatly
test = cbind(predicted_result, pheno)
t16 = test[test$Environment=="16h",]
t8 = test[test$Environment=="8h" ,]
baseline16 = cor(t16$Relative.Chlorophyll, t16$predicted_result, use = "complete")
baseline8 = cor(t8$Relative.Chlorophyll, t8$predicted_result, use = "complete")
mean(c(baseline8, baseline16))
## [1] 0.3278466
Not only was the model much faster, it also resulted in an
almost identical prediction accuracy (0.740).
To verify the selected markers are not worse than a combination
selected by change, we will sample 45 markers and run a genomic
prediction on these 100 times. Afterwards, we check the stats and test
if our previous selection was better or worse than a random set.
## run the precedure 1000 times and collect the correlation values
nam = row.names(u2) # marker names
sink = numeric()
for (i in c(1:100)){
sam = sample(c(1:ncol(geno3)), 20)
# generate a subset of the geno data
gsubset = geno3[, sam]
## test these marker in the genomic prediction model
prediction = mixed.solve(y = pheno$Relative.Chlorophyll, Z = gsubset, X = pheno[,c(7,8)])
marker_effect = prediction$u # center around 0
beta = prediction$beta # baseline to which the blup values differ
predicted_result = pheno$s1 * beta[1] + pheno$s2 * beta[2] + as.matrix(gsubset) %*% marker_effect
test = cbind(predicted_result, pheno)
t16 = test[test$Environment=="16h",]
t8 = test[test$Environment=="8h" ,]
baseline16 = cor(t16$Relative.Chlorophyll, t16$predicted_result, use = "complete")
baseline8 = cor(t8$Relative.Chlorophyll, t8$predicted_result, use = "complete")
a = mean(c(baseline8, baseline16))
sink = c(sink, a)
}
summary(sink)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.05819 0.18201 0.29073 0.27160 0.36081 0.63106
We can see - on average, the random sets of markers perform much worse than our selection - and even the maximum value observed was worse than our previously selected markers.
So to conclude, we reduced the marker count from 21,000 to 45, without sacrificing accuracy in predictions. That implies we can predict with only 0.2% of all markers the phenotype variance.
We will do a final validation of these markers - with a test and training set. Only these 45 markers are considered, and we will figure out how big the variance in a test set will be
gsubset = geno3[,colnames(geno3) %in% row.names(u2)]
# generate a matrix where we store the correlations for each fold
GP = data.frame(matrix(ncol=3, nrow=0)) # check the test and train correlation seperately
# we run a loop across all 10 folds
for(i in unique(pheno$CrossFold)){
# split the pheno in a test and a training set
train_pheno = pheno[pheno$CrossFold != i,]
test_pheno = pheno[pheno$CrossFold == i,]
# split the genotypes in a test and training data set
train_geno = gsubset[pheno$CrossFold != i,]
test_geno = gsubset[pheno$CrossFold == i,]
# run the genomic prediction with the ridge regression method
SE = mixed.solve(train_pheno$Relative.Chlorophyll, Z = train_geno, X = train_pheno[,c(7,8)])
# extract the marker effects as before
u = as.vector(SE$u) # similar for the test and training set
# use this to generate a genomic blup value for each genotype
blup_test = as.matrix(test_geno) %*% u
blup_train = as.matrix(train_geno) %*% u
# construct an estimate for the environment and the time as well
env_train = train_pheno[,7] * SE$beta[1]
time_train = train_pheno[,8] * SE$beta[2]
env_test = test_pheno[,7] * SE$beta[1]
time_test = test_pheno[,8] * SE$beta[2]
# add the fixed effects environment and time to the blup value, to retain an actual estimate of the chlorophyll content
se_estimate_train = env_train + time_train + as.vector(blup_train)
se_estimate_test = env_test + time_test + as.vector(blup_test)
# seperate the two environments from each other
# training
train = cbind(se_estimate_train, train_pheno)
t16 = train[train$Environment=="16h",]
t8 = train[train$Environment=="8h" ,]
baseline16 = cor(t16$Relative.Chlorophyll, t16$se_estimate_train, use = "complete")
baseline8 = cor(t8$Relative.Chlorophyll, t8$se_estimate_train, use = "complete")
a = mean(c(baseline8, baseline16))
# test
test = cbind(se_estimate_test, test_pheno)
t16 = test[test$Environment=="16h",]
t8 = test[test$Environment=="8h" ,]
baseline16 = cor(t16$Relative.Chlorophyll, t16$se_estimate_test, use = "complete")
baseline8 = cor(t8$Relative.Chlorophyll, t8$se_estimate_test, use = "complete")
b = mean(c(baseline8, baseline16))
GP = rbind(GP,c(a,b, i))
}
colnames(GP) = c("Train", "Test", "Fold")
GP = GP[order(GP$Fold),]
GP$Train = round(as.numeric(GP$Train), digits = 3)
GP$Test = round(as.numeric(GP$Test), digits = 3)
GP
## Train Test Fold
## 5 0.342 0.248 CF1
## 1 0.351 0.201 CF2
## 4 0.329 0.349 CF3
## 3 0.317 0.363 CF4
## 2 0.306 0.440 CF5
We summarize: 45 markers are enough to achieve a similarly high prediction accuracy of the chlorophyll content in this set of 23 genotypes.
In an similar experiment, the same phenotypes (not equal - instead of Chlorophyll, the SPAD was measured) for 253 progenies of 6 parents were measured. The goal now is to use the trained ridge regression model from before to estimate the chlorophyll content in the progeny lines.
# load the genotype and phenotype data of the progenies
pro_pheno = read.csv("/Data/michael/data/task/Adjusted_means_daywise_SPAD.csv")
pro_geno = read.csv("/Data/michael/data/task/genotypes_progenies.csv", sep=" ", na.strings="")
# convert the letters to integers, as we did it before
# using the "ref" vector we defined before
# convert the coding to ref = -1 and alt = 1
for (i in c(1:ncol(pro_geno))){
pro_geno[pro_geno[,i] != ref,i] = "1"
pro_geno[pro_geno[,i] == ref,i] = "-1"
}
# convert the misisng to NA
for (i in 1:ncol(pro_geno)){pro_geno[,i] = as.integer(pro_geno[,i])}
# transpose
pro_geno = as.data.frame(t(pro_geno))
# match the geno and pheno gentypes
pro_geno = pro_geno[row.names(pro_geno) %in% pro_pheno$Lotcode2,]
# add entries per date / env
pro_geno2 = do.call(rbind, replicate(38, pro_geno, simplify=FALSE)) # 38 results from days * env
# convert the pheno data - environment and time
pro_pheno$s1 = as.integer(as.factor(pro_pheno$Env))
pro_pheno$date = as.Date(pro_pheno$date, format = "%Y-%m-%d")
pro_pheno$s2 = as.integer(as.factor(pro_pheno$date))
# using the parental fitted data to predict the progenies
SE = mixed.solve(pheno$Relative.Chlorophyll, Z = geno3, X = pheno[,c(7,8)])
# apply it to the progenies
Za = as.matrix(pro_geno2) %*% SE$u
predicted = SE$beta[1] * pro_pheno$s1 + SE$beta[2] * pro_pheno$s2 + Za[,1]
# split the environments
SP = cbind(predicted, pro_pheno)
t16 = SP[SP$Env=="16h",]
t8 = SP[SP$Env=="8h" ,]
baseline16 = cor(t16$SPAD, t16$predicted, use = "complete")
baseline8 = cor(t8$SPAD, t8$predicted, use = "complete")
mean(c(baseline8, baseline16))
## [1] 0.07879276
# make a plot
pro_pheno$predicted = predicted
ggplot(pro_pheno, aes(x=SPAD, y=predicted, shape = as.factor(s1), color = s2,
size= s2)) +
geom_point() + #geom_text(hjust=0, vjust=0) +
ylab("Predicted") + xlab("SPAD")
## Warning: Removed 133 rows containing missing values (`geom_point()`).
Looking at this data, we must conclude the prediction ability is bad. Why is that? This can have multiple reasons, like the training set to be genetically to narrow. We could validate this by reverse testing the genomic selection. Therefore, we only need to train the GS model on the progenies and then apply it to the parental lines.
As the data set of the progenies is 12 times bigger, this will take a lot of time. Additional, it is quite unlikely that the genomic variance of the parents in lower than in the tested progenies.
Anyway, this is the code:
# SE = mixed.solve(pro_pheno$SPAD, Z = pro_geno2, X = pro_pheno[,c(7,8)])
#
# # apply it to the progenies
# Za = as.matrix(geno3) %*% SE$u
#
# predicted = SE$beta[1] * pheno$s1 + SE$beta[2] * pheno$s2 + Za[,1]
#
# # split the environments
# SP = cbind(predicted, pro_pheno)
# t16 = SP[SP$Env=="16h",]
# t8 = SP[SP$Env=="8h" ,]
#
# baseline16 = cor(t16$Relative.Chlorophyll, t16$Relative.Chlorophyll, use = "complete")
# baseline8 = cor(t8$Relative.Chlorophyll, t8$Relative.Chlorophyll, use = "complete")
#
# mean(c(baseline8, baseline16))
#
# # make a plot
# pheno$predicted = predicted
#
# ggplot(pheno, aes(x=Relative.Chlorophyll, y=predicted, shape = as.factor(s1), color = s2,
# size= s2)) +
# geom_point() + #geom_text(hjust=0, vjust=0) +
# ylab("Predicted") + xlab("Relative Chlorophyll content")
The most likely reason for the underfit lies in the phenotype data and the data assessment. The gadget the measurements were taken with tends to produce different values with repeated measurements. The parents were measured in 6 replications, while the progeny lines were measured with 2 to 3 replications. So the chance for having a wrong adjusted mean for the progenies is much higher than for the parents - which might cause the deviation in the prediction ability.
So we can say - shit in, shit out. Nevertheless, there is
another phenotype - the plant heigt, which might be easier to measure
than the chlorophyll content. How does it behave here? Can the genomic
prediction, trained on the parental lines, be used to predict the
progeny lines plant height?